Delivery Time Prediction

AIM: To predict the time taken to deliver the item given information about the delivery agent, distance and other external factors.

This problem comes under the purvue of supervised machine learning. The label that we want to predict is a continuous number thus, it's a regression problem.

Importing important libraries

Data

Function to help in loading the data

Load the training and test data

Splitting the train-test data

Datatype of the features

EDA

Features in the Data and their expected data type

  1. object represents strings.
  2. Datatype of all the features is set to Object.
  3. Processing requires setting datatype to correct value.
  4. One can set the features into following datatype.
Date time Floats Category
Delivery_person_ID
Delivery_person_Age
Delivery_person_Ratings
Restaurant_lattitude
Restaurant_longitude
Delivery_location_lattitude
Delivery_location_longitude
Order_Date
Time_Ordered
Time_Order_Picked
Weather conditions
Road_traffic_density
Vechicle_condition
Type_of_order
Type_of_vehicle
Multiple_deliveries
festival
City
  1. In doing so we need to convert NA values to np.nan
  2. Finding the number of Invalid values in the Data

Useful functions

Datatype conversion

Before we start the job of modelling the data and prediction we need to convert features into appropriate data type.

  1. Converting features into the correct datatype.
  2. Find number of missing values.

Exploratory analysis

Delivery Person ID

Numerical Features

Coordinate data

Issues

  1. Restaurant location data:
    1. Restaurant location data has longitude and lattitude at 0 (not possible).
    2. Restaurant locations with negative longitude and lattide are actually placed incorrectly needs further investigation how to cure them.
  2. Delivery location data
    1. Delivery_lattitude data and longitude data has 0 as values.(not possible).
    2. These values can be cured assuming 0 is actually Null values

Plotting only the data points that are problematic in the Restaurant location

It seems like the issue regarding the restaurant location can be caused by the incorrect sign in front of the latitude and longitude data. One can just use an absolute value of the data point

Categorical Data

  1. The average time taken for delivery during a sunny day is less compared to other weather conditions.
  2. The average time taken to delivery is similar to all order type
  3. Average Time taken for delivery in a Semi-Urban city is higher compared to Urban or metropolitan
  4. Average Time taken to deliver is the traffic density is low.
  5. Average time taken is similar to all type of vehicle
  6. Average time taken is higher during a festival.

Baseline

Replace few values with NA's

Pipeline for feature extraction and modelling

Using Gradient Boosting Regressor as a base classifier

Feature Engineering

Using existing information to generate more features.

Delivery agent information

Date time Data

  1. Order date can give information about day, month and week the order was placed and that in turn can help in finding the relationship between time_taken and the day (traffic pattern on weekdays and weekends time of the day).
  2. Time ordered: can be dropped since the time taken to deliver might depend more on the time order was picked
  3. Time_Order_picked: can be used but it needs to be changed before converting it into pandas date-time object (few observations are incorrect).

Distance data

#

Delivery agent information

Time Date Information

Distance

Removing unnecessary columns

Fitting the new data set with new features

Fitting a Gradient Boosting Regressor

Conclusion

Using the data given and the tools in hand one can train a simple gradient regressor to predict the label. The predictions were made using the historical data. One can employ combination of hyperparameter tuning, ensembling and stacking to improve upon the results obtained here.